3  Data Structures

R uses various different types of ‘data structures’ which are really just ways to store information of varying types. A lot of the data structures R uses are defined in terms of vectors, so it is important you are familiar with constructing and manipulating vectors in R (see S2S Lab 1).

Some of the common data structures we will become familiar with are;


Each of the data structures presented here are covered using additional examples in Section 1.9 R Data Structures of Probability and Statistics with R.

3.1 Arrays

Arrays are defined as multidimensional arrangements of elements. This means that rather than storing data in a one dimensional vector, you can spread the elements of this vector across multiple dimensions.

This sounds quite complicated so let’s look at an example. First, let’s create a long vector of the numbers 1 up to 24 to give us elements to populate this array with.

vect <- 1:24

We can then use the array() function to turn the vector vect into an array. The array() function takes the following arguments:

  • data =: this is the vector of elements that we want to populate the array with.
  • dim =: this is another vector giving the maximum number of rows first, then the maximum number of columns and finally the maximum number of ‘layers’.

We can turn vect into an array, called A1, using the following code.

A1 <- array(data = vect, dim = c(2, 4, 3))
A1
, , 1

     [,1] [,2] [,3] [,4]
[1,]    1    3    5    7
[2,]    2    4    6    8

, , 2

     [,1] [,2] [,3] [,4]
[1,]    9   11   13   15
[2,]   10   12   14   16

, , 3

     [,1] [,2] [,3] [,4]
[1,]   17   19   21   23
[2,]   18   20   22   24

Here we have populated A1 with the values 1 to 24, so that it has dimensions 2 \(\times\) 4 \(\times\) 3. This means that we have created 3 layers where each layer is a 2 \(\times\) 4 matrix.

The values from vect are entered into A1 going down the columns first, then moving from left to right before moving onto the next layer. This order of entering elements is called column-major order, since columns are filled in first.


You can learn more about creating arrays in Section 1.9.1 Arrays and Matrices of Probability and Statistics with R.

3.2 Matrices

Creating matrices

Matrices can be thought of as two dimensional arrays i.e. they don’t have an argument saying how many layers they should contain. Therefore, matrices can also be created using the array() function and ensuring that the dim = argument is provided with a vector only of length two. This vector then corresponds to the number of rows and columns, respectively, that the matrix has.

To create a matrix called M1, which is populated with the elements from vect and has 6 rows and 4 columns, we can use the following code.

M1 <- array(data = vect, dim = c(6, 4))
M1
     [,1] [,2] [,3] [,4]
[1,]    1    7   13   19
[2,]    2    8   14   20
[3,]    3    9   15   21
[4,]    4   10   16   22
[5,]    5   11   17   23
[6,]    6   12   18   24

Since M1 has dimensions 6 \(\times\) 4, it contains 24 elements - the same number of elements as is in the vector vect. If instead we had defined dimensions which gave fewer than 24 elements for M1, then array() would go through and fill in the elements of M1 with the elements of vect in column-major order, until there were no spaces left to fill. This would mean that not all the values from vect would appear in M1.

array(data = vect, dim = c(6, 2))
     [,1] [,2]
[1,]    1    7
[2,]    2    8
[3,]    3    9
[4,]    4   10
[5,]    5   11
[6,]    6   12

If we had defined the dimensions so that M1 contained more than 24 elements, then the values of vect would be repeated for as long as necessary until all of the elements of M1 have a value.

array(data = vect, dim = c(6, 5))
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    7   13   19    1
[2,]    2    8   14   20    2
[3,]    3    9   15   21    3
[4,]    4   10   16   22    4
[5,]    5   11   17   23    5
[6,]    6   12   18   24    6

Task

Use the array() function and the letters vector to create a 5 \(\times\) 5 matrix containing the letters of the alphabet in column-major order, up to “y”.

It is also possible to create matrices using the matrix() function. This has the advantage of allowing you to specify whether the elements should be filled in using column-major order or row-major order (where the elements are filled in from left to right along rows and then from top to bottom). The arguments that the matrix() function can be given are:

  • data =: this is the vector of elements that we want to fill in the matrix with.
  • nrow =: this is the number of rows the matrix should contain.
  • ncol =: this is the number of columns the matrix should contain.
  • byrow =: this takes values TRUE or FALSE and states whether the elements should be entered in row-major order (TRUE) or column-major order (FALSE). By default, the value is FALSE and elements will be entered in column-major order if you miss out this argument.

Only one of nrow = or ncol = needs to be included in the matrix() function because R will automatically calculate how many of the non-specified argument are required, based on the length of the vector given to data =.

We can create the same matrix as M1 using the matrix() function. Let’s call it M2.

M2 <- matrix(data = vect, ncol = 4, byrow = FALSE)
M2
     [,1] [,2] [,3] [,4]
[1,]    1    7   13   19
[2,]    2    8   14   20
[3,]    3    9   15   21
[4,]    4   10   16   22
[5,]    5   11   17   23
[6,]    6   12   18   24

If we wanted to fill in the elements in row-major order, then we could instead use the following code.

M3 <- matrix(data = vect, nrow = 6, byrow = TRUE)
M3
     [,1] [,2] [,3] [,4]
[1,]    1    2    3    4
[2,]    5    6    7    8
[3,]    9   10   11   12
[4,]   13   14   15   16
[5,]   17   18   19   20
[6,]   21   22   23   24

Task

Create a 5 \(\times\) 5 matrix containing the first 25 letters of the alphabet using matrix(). Fill in the elements in row-major order.

Naming rows and columns

It is possible to name rows and columns of a matrix. This is useful if it holds some data corresponding to different variables that you want to label, for example we might want to represent the following table, showing the number of births recorded in four different cities in Scotland in the years 2017, 2018 and 2019, as a matrix.

Table 3.1: Recorded births in Scottish cities in 2017, 2018, 2019.
2017 2018 2019
Edinburgh 5033 4899 4683
Glasgow 6852 6548 6553
Aberdeen 2402 2337 2260
Dundee 1493 1488 1417

We can first of all create a matrix births containing these values and then we can name the rows and columns.

data <- c(5033, 4899, 4683,
          6852, 6548, 6553, 
          2402, 2337, 2260, 
          1493, 1488, 1417)

births <- matrix(data = data, nrow = 4, byrow = TRUE)

cities <- c("Edinburgh", "Glasgow", "Aberdeen", "Dundee")
years <- c("2017", "2018", "2019")

dimnames(births) <- list(cities, years)
births
          2017 2018 2019
Edinburgh 5033 4899 4683
Glasgow   6852 6548 6553
Aberdeen  2402 2337 2260
Dundee    1493 1488 1417

Here, we have used the function dimnames(). By itself, dimnames() will extract the row and column names of a matrix, but we can also set these names by using the assign operator <-. We have put the two vectors on the right hand side of the <- operator, meaning we want the row and column names to be these two vectors (see Appendix C for more details on the list() function).

We can now use either the row/column names to extract particular elements from the matrix. We do this using square brackets, [ ], similar to with vectors, but we now need to specify the row and column we are interested in.

For example, if we wanted to extract the number of births in Dundee in 2017, we can run either of the following lines of code.

births["Dundee", "2017"]
[1] 1493
births[4, 1]
[1] 1493

Note that when we use the row/column names, they are in quotation marks because they are saved as character vectors in R. It is important that the row you are interested in is stated first in [ ], and then the column.

Task

What is the code you would use to show the number of births in Glasgow in all 3 years?

See Appendix A to learn how to find the dimensions of whatever you extract from a matrix.

Calculating statistics

We can apply a function across the rows or columns of a matrix to calculate the mean or standard deviation for example using the function apply(). The arguments that can be given to apply() include:

  • X =: this is the matrix (or array) we want to apply the function to.
  • MARGIN =: this tells R whether we want to apply the function to the rows or the columns. A value of 1 means the function will be applied to the rows and 2 means the function will be applied to each column.
  • FUN =: this is the function we want to apply. It can be things like the mean (mean), median (median), or standard deviation (sd).

For example, if we wanted to know the mean births for each city across the three years, we can use the following code.

apply(X = births, MARGIN = 1, FUN = mean)
Edinburgh   Glasgow  Aberdeen    Dundee 
 4871.667  6651.000  2333.000  1466.000 

Task

What is the standard deviation for the number of births in 2019?

3.3 Factors

Factors are similar to vectors in R, however they have additional information and are used to store categorical data, for example someone’s gender or marriage status. They record the “levels” of the categorical variable stored within the vector which each numerical value corresponds to.

For example, suppose you are interested in the qualification level of several university alumni. You might use a simple encoding of 1=“Bachelor’s degree”, 2=“Master’s degree” and 3=“PhD” to record these data.

This might give us data that looks like the following vector degree.

degree1 <- c(1, 1, 2, 1, 3)
degree1
[1] 1 1 2 1 3

By itself, degree is not very informative about what level of degree each student was awarded. We can fill in the rest of this information by changing degree1 from a vector to a factor.

To create a factor we use the function factor(). This function can be given the following arguments:

  • x =: the vector of data which we want to define categories for.
  • levels =: this is a vector of all possible values that the elements in x can take.
  • labels =: this is a vector containing the names of each level of the category.

For example, to convert the vector degree1 into a factor called degree_factor1 we use:

degree_factor1 <- factor(x = degree1, levels = 1:3,
                         labels = c("Bachelor's", "Master's", "PhD"))
degree_factor1
[1] Bachelor's Bachelor's Master's   Bachelor's PhD       
Levels: Bachelor's Master's PhD

If instead, degree had been a character vector stating the level of degree awarded, we can still turn this into a factor so that R knows this is categorical data and there are only three levels we are interested in.

degree2 <- c("Bachelor's", "Bachelor's", "Master's", "Bachelor's", "PhD" )

degree_factor2 <- factor(x = degree2,
                         levels = c("Bachelor's", "Master's", "PhD"))
degree_factor2
[1] Bachelor's Bachelor's Master's   Bachelor's PhD       
Levels: Bachelor's Master's PhD

You can also change the labels of the levels used within a pre-existing factor using the levels() function.

levels(degree_factor2) <- c("BSc", "MSc", "PhD")
degree_factor2
[1] BSc BSc MSc BSc PhD
Levels: BSc MSc PhD

Task

The results from a survey asking students whether statistics is the best subject are shown below. They were given a choice of “Agree”, “Disagree” and “Unsure”.

Student Answer
Student 1 Agree
Student 2 Agree
Student 3 Agree
Student 4 Unsure
Student 5 Disgaree

Create and print a factor, called survey, which contains the answers of these five students as well as the levels of response they could have given.


Creating factors in R is covered in Section 1.9.3 Factors of Probability and Statistics with R.

3.4 Data frames

Data frames in R are very similar to matrices. The key difference however is that whilst all elements in a matrix must be of the same “mode” (e.g. numeric, character, logical), each column in a data frame can be of a different mode. If you needed to store a numeric vector, a logical vector and a character vector that all relate to the same subjects for example, then a data frame is the way to do this.

Data frames are a very common type of data structure used within R. Most of the data you will see saved in packages, or the data you will use for fitting statistical models will be saved in a data frame.

To create a data frame, we can use the function data.frame(). The only arguments needed are the pre-existing vectors that you want to save within the data frame. These all need to be of the same length. Some additional arguments include:

  • stringAsFactor =: this takes the values TRUE or FALSE and tells R whether any character vectors should be turned into factors. If this argument is excluded, R takes the default value to be FALSE, so character vectors will kept as they are.
  • row.names =: this can be a vector of names you wish to use for the rows of the data frame. By default, R will just number the rows starting from 1.

We can create a data frame storing information about students’ performance in a course using the following code.

percentage <- c(84, 76, 90, 53, 6, 67)
grade <- c("A", "A", "A", "C", "H", "B")
pass <- c(TRUE, TRUE, TRUE, TRUE, FALSE, TRUE)

performance <- data.frame(percentage, grade, pass)
performance
  percentage grade  pass
1         84     A  TRUE
2         76     A  TRUE
3         90     A  TRUE
4         53     C  TRUE
5          6     H FALSE
6         67     B  TRUE

We can see that within the data frame performance, percentage is a numeric vector, grade is a character vector and pass is a logical vector by using the function str().

str(performance)
'data.frame':   6 obs. of  3 variables:
 $ percentage: num  84 76 90 53 6 67
 $ grade     : chr  "A" "A" "A" "C" ...
 $ pass      : logi  TRUE TRUE TRUE TRUE FALSE TRUE

We could change grade to be a factor by adding stringsAsFactor = TRUE within the data.frame() function. We can also create a vector of students’ IDs and use this to name the rows of the data frame in the following code.

ids <- c("ST002", "ST014", "ST089", "ST060", "ST034", "ST056")

performance <- data.frame(percentage, grade, pass,
                          stringsAsFactors = TRUE, row.names = ids)
performance
      percentage grade  pass
ST002         84     A  TRUE
ST014         76     A  TRUE
ST089         90     A  TRUE
ST060         53     C  TRUE
ST034          6     H FALSE
ST056         67     B  TRUE

Now if we use the str() function, we can see that grade is treated as a factor.

str(performance)
'data.frame':   6 obs. of  3 variables:
 $ percentage: num  84 76 90 53 6 67
 $ grade     : Factor w/ 4 levels "A","B","C","H": 1 1 1 3 4 2
 $ pass      : logi  TRUE TRUE TRUE TRUE FALSE TRUE

Extracting elements

Elements from data frames can be extracted in a couple of ways. For example, we can use square brackets [ ], or we can use the dollar sign operator $.

For example, if we wanted to extract just the vector pass from the data frame performance, we can use any of the following code.

performance[ , "pass"]
[1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
performance[ , 3]
[1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
performance$pass
[1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE

When we use square brackets, we also need to specify which rows we want to exclude as the first entry (before the comma) within the square brackets. In the code above, we haven’t specified any rows, so R shows us all of the rows from performance.

Task

Write code to extract only the percentage and the associated grade for the student with ID ST014?

An alternative way to easily extract columns from a data frame is to use the attach() function. The only argument needed here is the data frame you want to attach to something called the ‘search path’ in R. This just means that you no longer need to type in the name of the data frame to access its columns.

Before we do this, we are going to run the following code to remove the original vectors we created from the Environment tab using the rm() function. We do this so that R doesn’t just show us these pre-existing vectors directly, but instead looks within the data frame performance.

rm("percentage", "grade", "pass")

attach(performance)
grade
[1] A A A C H B
Levels: A B C H

The attach() function is useful if you are going to be using the same data frame over and over, but it is good practice to ‘detach’ it using the detach() function once you no longer need the data frame.

detach(performance)

Using packages

When conducting statistical analysis, you won’t often need to create your own data frame of information from scratch - it will most likely already exist in some format somewhere! One place where data might be stored is in the packages you can install and load into R.

We have already installed and loaded the package PASWR2 in Section 2. We can now see a list of the data frames stored in the PASWR2 package by using the following code.

data(package = "PASWR2")

We can see more information about the data stored in any of these data sets using the help() function. For example, if we wanted to know what the data in the data frame RAT related to, we could use the following code.

help("RAT")

To view an extract of this data frame, we can use the function head() and provide as an argument the name of the data frame. This will show us the first 6 rows of a data frame by default.

head(RAT)
  survival.time
1           152
2           152
3           115
4           109
5           137
6            88

To save a data frame from a package in our own Environment tab, we use the function data(). This will read in the data frame and means we can use it as we would any other data frame that we had created ourselves.

data("RAT")

Task

The package PASWR2 contains a data set called WAIT. What do the wait times saved in this data set relate to?

Task

Write code to first view the top 5 rows of the data frame WAIT and then load it into your Environment.


All of this information is covered with further examples in Section 1.9.5 Data Frames of Probability and Statistics with R.